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This report presents a theoretical study of the transmission of Infor¬ 
mation In the case of discrete messages and noiseless systems. The study 
begins with the definition of a unit of Information (a selection between 
two choices equally likely to be selected), and this is then used to deter¬ 
mine the amount of information conveyed by the selection of one of an 
arbitrary number of choices equally likely to be selected. Next, the average 
amount of Information per selection Is computed In the case of messages con¬ 
sisting of sequences of Independent selections from an arbitrary number of 
choices with arbitrary probabilities of their being selected. A recoding 
procedure Is also presented for Improving the efficiency of transmission by 
reducing, on the average, the number'of selections (digits or pulses) re¬ 
quired to transmit a message of given length and given statistical character. 
The results obtained In the case of sequences of Independent selections are 
extended later to the general oase of non-independent selections. Finally, 
the optimum condition Is determined for the transmission of Information by 
means of quantised pulses when the average power Is fixed. 


THE TRANSMISSION OP INFORMATION 



It is the opinion of many workers in the field of electrical communi¬ 
cations that the communication art is today at a major turning point of its 
development* The objective of almost all electrical communication systems 
has been, up to now, to eliminate distance in some form of human activity or 
relationships between men* Telegraph, telephone and television are typical 
examples of such communication systems* We may add to these teletype, tele¬ 
control and telemetering* It is interesting to note that the names of all 
these communication systems involve the prefix tele , meaning at a distance . 

Although, for obvious reasons, forms of communication over distances 
much greater than the ranges of human senses and reach were first to receive 
attention, the magnitude of the distance involved is not of primary impor¬ 
tance from a logical point of view in the concept of communication. Com¬ 
munication is basically any form of transmission of information,regardless 
of the distance between the transmitter and the receiver. In a broader 
sense, the field of communication includes any handling, combining, comparing 
or employing of information, since such processes involve and are intimately 
connected with the transmission of such information. 

It Is clear, then, that most human activities involve communication in 
a broad sense, and, in particular, those activities which are considered of 
higher intellectual type because they depend to a high degree on the process 
of "thinking". Thinking itself, in fact, involves a natural communication 
system of a complexity far beyond that conceivable for any man-made system. 

The above considerations point clearly to a very wide field of useful 
applications of the communication art which has hardly been touched as yet. 

It is to be expected that each application should present problems of a 
higher order of complexity than those encountered in the past. Consequently, 
it is also to be expected that the solution of these problems should neces¬ 
sitate the use of more powerful analytical tools and, particularly, should 
require a more fundamental study of the process of transmission of informa¬ 
tion. As a matter of fact, the first and most significant step in the 
direction of such a study was made by Norbert Wiener (1) in connection with 
the development of predictors for antiaircraft fire control. The statistical 
nature of this problem led him to the realization that all communication 
problems are fundamentally of a statistical nature, and must be handled 
accordingly. He argued that the signal to be transmitted in a communication 
system can never be considered as a known function of time, because if it 
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were a priori known it could not convey any new information and therefore 
would not need to be transmitted. On the other hand, what can be known 
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a priori about a signal to be transmitted is its statistical character — 
that is, for instance, the probability distribution of its amplitude. In 
addition, it is equally clear, that noise, vhlch plays such an important 
part in communication problems, can be described only in statistical terms. 

It follows that all communication problems are inherently statistical in 
nature, and that disregarding this fact may lead to unexplainable inconsist¬ 
encies in addition to precluding a deeper understanding of such problems. 

The statistical theory of optimum prediction and filtering developed 
by Wiener led further to the realization of the need for a basic and general 
criterion for judging the quality of communication systems. In fact, the 
mean-square error criterion used by Wiener in this part of his work is dic¬ 
tated by mathematical convenience rather than by physical considerations; 
consequently it may not be useful in certain practical problems. The search 
for a more appropriate criterion leads naturally to the question of what is 
the operation that a communication system must perform. If we take as an 
example a telegraph system, it might seem at first obvious that such a system 
must reproduce at the output each and every letter of the input message in 
the proper order. We may observe, however, that if one letter is received 
incorrectly, the word containing it is still perfectly understandable in 
most cases, and so, of course, is the whole message. Moreover, the message 
would still be comprehensible if, for instance, all the vowels were elimi¬ 
nated (which is what is done in written Hebrew) * On the other hand, the 

incorrect transmission of a digit in a number would make the received mes¬ 
sage Incorrect. 

It appears therefore that the transmission of the information conveyed 
by a written message is what we wish to obtain and that this is not neces¬ 
sarily equivalent to the transmission of all the letters contained in the 
written message. More precisely, it appears that the different symbols, 
letters or figures contained in a written message do not contribute equally 
to the transmission of information — so much so, that some of them may be 
completely unnecessary. Similar conclusions are reached by considering 
other types of co mmuni cation systems. In particular, the recent work on 
the Vocoder (2) and the clipping of speech waves ( 3 ) has provided consider¬ 
able evidence in the same general direction. 

The above considerations are relevant to another problem with which 
communication engineers are becoming more and more concerned, namely, that 
of bandwidth reduction. As a matter of fact, the Vocoder was developed 
primarily for the purpose of reducing the bandwidth required for speech 
transmission. It is clear that if different parts of a message are not 
equally important, some saving in bandwidth might be possible by providing 
transmission facilities which are proportional to the importance of these 







different parts. The bandwidth problem, in turn, is intimately connected 
with the noise-reduction problem. In fact, all the different types of 
modulation developed for the purpose of noise and interference reduction 
require a bandwidth wider than that required by amplitude modulation. This 
method of paying for an improved signal-to-noise ratio with an increased 
bandwidth appears to be the result of some fundamental limitation which, 
however, the conventional approach to communication problems has failed to 

clarify. 

The above discussion of some of the problems confronting or likely to 
confront the communication engineer indicates clearly the necessity of pro¬ 
viding a measure for the "thing" which is to be transmitted and which has 
been vaguely called "information". Such a measure will then permit a quan¬ 
titative and more fundamental study of the process involved in the trans¬ 
mission of information which, in turn, will lead eventually to the design 
of better and more efficient communication devices. A considerable amount 
of work in this direction has already been done independently by Horbert 
Wiener (4) and Claude Shannon (5). The work of Wiener is particularly out¬ 
standing because of its philosophical profoundness and its importance in 
many branches of science other than communication engineering. Mention 
should be made also of the pioneering work of Hartley (6) and of the more 
recent work of Tuller (7). 

This paper presents the work done by the author in the past year on 
the transmission of discrete signals through a noiseless channel. Although 
most of the results obtained have already been published by Wiener and 
Shannon, it is felt that the method of approach used here is sufficiently 

different to justify this redundant presentation. 

~ ’• . ... - 

I. Definition of the Unit of Information 

In order to define, in an appropriate and useful maimer, a unit of 
Information, we must first consider in some detail the nature of those 
processes in our experience which are generally recognized as conveying 
information. A very simple example of such processes is a yes-or-no answer 
to some specific question. A slightly more involved process is the indica¬ 
tion of one object in a group of H objects, and, in general, the selection 
of one choice from a group of flf specific choices. The word "specifio" is 
underlined because such a qualification appears to be essential to these 
information-conveying processes. It means that the receiver is conscious 
of all possible choices, as is, of course, the transmitter (that is, the 
individual or the machine which is supplying the information). For Instance 
saying "yes" or "no" to a person who has not asked a question obviously does 
not convey any information. Similarly, the reception of a code number which 


Is supposed to represent a particular message does not convey any informa¬ 
tion unless there is available a code book containing all the messages vlth 
the corresponding code numbers. 

Considering next more complex processes, such as writing or speaking, 
ve observe that these processes consist of orderly sequences of selections 
from a number of specific choices, namely, the letters of the alphabet or 
the corresponding sounds. Furthermore, there are indications that the sig¬ 
nals transmitted by the nervous system are of a discrete rather than of a 
continuous nature, and might also be considered as sequences of selections. 

If this were the case, all information received through the senses could be 
analysed in terms of selections. The above discussion indicates that the 
operation of selection forms the basis of a number of processes recognised 
as conveying information, and that it is likely to be of fundamental impor¬ 
tance in all such processes. We may expect, therefore, that a unit of 
information, defined in terms of a selection, will provide a useful basis 
for a quantitative study of communication systems. 

Considering more closely this operation of selection, ve observe that 
different informational value is naturally attached to the selection of the 

same choice, depending on how likely the receiver considered the selection 
of that particular choice to be. For example, ve would say that little 
information is given by the selection of a choice which the receiver was 
almost sure would be selected. It seem appropriate, therefore, in order to 
avoid difficulty at this early stage, to use in our definition the particular 
case of equally likely choices — that is, the case in which the receiver has 
no reason to expect that one choice will be selected rather than any other. 

In addition, our natural concept of information Indicates that the informa¬ 
tion conveyed by a selection increases with the number of choices from which 
the selection is made, although the exact functional relation between these 
two quantities is not immediately clear. 

On the basis of the above considerations, it seems reasonable to define 
as the unit of information the simplest possible selecti on, namely, the 
selection between two equally likely choices, called, hereafter, the "ele¬ 
mentary selection". For completeness, we must add to this definition the 
postulate, consistent with our intuition, that N Independent selections of 
this type constitute N units of information. By independent selections ve 
mean, of course, selections which do not affect one another. We shall adopt 
for this unit the convenient name of "bit" (from "binary digit"), suggested 
by Shannon. We shall also refer to a selection between two choices (not 
necessarily equally likely) as a "binary selection", and to a selection from 
H choices, as an N-order selection. When the choices are, a priori, equally 
likely, we shall refer to the selection as an "equally likely selection". 
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We can nov proceed to develop ways of measuring the Information content of 
discrete messages in terms of the unit just defined. Most of this paper 
will be devoted to the solution of this problem. 

II. Selection from H Equally Likely Choices 

Consider nov the selection of one among a number, H, of equally likely 
choices. In order to determine the amount of information corresponding to 
such a selection, we must reduce this more complex operation to a series of 
independent elementary selections. The required number of these elementary 
selections will be, by definition, the measure in bits of the information 
given by such an N-order selection. 

Let us assume for the moment that N is a power of two. In addition 
(just to make the operation of selection more physical), let us think of 
the H choices as H objects arranged in a row, as indicated in Figure 1. 

. Binary 

Humber 


Fig. 1 Selection procedure for 
equally likely choices. 


These H objects are first divided in two equal groups, so that the object 
to be selected is just as likely to be in one group as in the other. Then 

the indication of the group containing the desired object is equivalent to 

one elementary selection, and, therefore, to one bit. The next step con¬ 
sists of dividing each group into two" equal subgroups, so that the object 

to be selected is again just as likely to be in either subgroup. Then one 

additional elementary selection, that is a total of two elementary selec¬ 
tions, will suffice to indicate the desired subgroup (of the possible four 
subgroups). This process of successive subdivisions and corresponding ele¬ 
mentary selections is carried out until the desired object is isolated from 





the others. Tvo subdivisions are required for N = 4, three for N « 8, and/ 

In general, a number of subdivisions equal to loggH, In the case of an 
H-order selection. 

The same process can be carried out In a purely mathematical form by 
assigning order numbers from 0 to H-l to the K choices. The numbers are 
then expressed In the binary system, as shown In Figure 1, the number of 
binary digits (0 or 1) required being equal to loggH. These digits represent 
an equal number of elementary selections and, moreover, correspond In order 
to the successive divisions mentioned above. In conclusion, an H-order, 
equally likely selection conveys an amount of Information 

Hjj « loggH . (1) 


The above result is strictly correct only If H is a power of two. In 
which case H« Is an Integer. If H Is not a power of two, then the number of 
elementary selections required to specify the desired choice will be equal 
to the logarithm of either the next lower or the next higher power of two, 
depending on the particular choice selected. Consider, for Instance, the 
case of H » 3. The three choices, expressed as binary numbers, are then 


9 


t 

9 


If the binary digits are read In order from left to right. It Is clear 
that the first two numbers require two binary selections — that Is, two 
digits, while the third number requires only the first digit, 1, In order to 
be distinguished from the other two. In other words, the number of elemen¬ 
tary selections required when H Is not a power of two Is equal to either one 
of the two Integers closest to loggH. It follows that the corresponding 
amount of Information must lie between these two limits, although the sig¬ 
nificance of a non-Integral value of H la not clear at this point. It will 
be shown in the next section that Eq.(l) is still correct when H Is not a 
power of two, provided Is considered as an average value over a large 
number of selections. 


III. Messages and Average Amount of Information 

We have determined In the preceding section the amount of Information 
conveyed by a single selection from H equally likely choices. In general, 
however, we have to deal with not one but long series of such selections, 
which we call messages. This Is the case, for instance. In the transmission 
of written Intelligence. Another example Is provided by the communication 
system known as pulse-code modulation, in which audio waves are sampled at 
equal time Intervals and then each sample is quantized, that is approximated 
by the closest of a number H of amplitude levels. 




Let us consider, then, a message consisting of a sequence of n succes¬ 
sive H -order selections. We shall assume, at first, that these selections 
are independent and equally likely. In this simpler case, all the different 
sequences which can be formed equal in number to 

S - H 11 , (2) 

are equally likely to occur. For Instance, In the case of H - 2 (the two 
choices being represented by the numbers 0 and 1) and n = 3, the possible 
sequences would be 000, 001, 010, 100, 011, 101, 110, 111. The total number 
of these sequences is S - 8 and the probability of each sequence is 1/8. 

In general, therefore, the ensemble of the possible sequences may be con¬ 
sidered as forming a set of S equally likely choices, with the result that 
the selection of any particular sequence yields an amount of Information 

Hg * loggS * n loggN. (?) 

In words, n independent equally likely selections give n times as much 
information as a single selection of the same type. This result is certainly 
not surprising, since it is just a generalization of the postulate, stated 
in Section II, which forms an integral part of the definition of information. 

It is often more convenient, in dealing with long messages, to use a 
quantity representing the average amount of information per Jl-order selection, 
rather than the total information corresponding to the whole message. We 
define this quantity in the most general case as the total information con¬ 
veyed by a very long message divided by the number of selections in the 
message, and we shall indicate it with the symbol H^, where H is the order 
of each selection. It is clear that when all the selections in the message 
are equally likely and independent and, in addition, N is a power of two, 

- —* ~‘ h 
selection, that is 



We shall show now that this equation is correct also when N is not a power 
of two, in which case H- has to be actually an average value taken over a 
sufficiently long sequence of selections. 

The number S of different and equally likely sequences which can be 
formed with n independent and equally likely selections is still given by 
Eq.(2), even when H is not a power of two. On the contrary, the number of 
elementary selections required to specify any one particular sequence must 


The author is indebted to Mr. T. P. Cheatham, Jr. (of this Laboratory) for the 
original idea on which is based both this proof and the corresponding recoding 
procedure (see Section IV). 
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be written now in the form 


B 


S 


a t d ^ 



where d is a number, smaller 
integer and which depends on 
amount of Information per N- 


in magnitude than unity, which makes B e an 

D 

the particular sequence selected. The average 
order selection is then, by definition. 


H 


N 


1 

n 



S 4* 




' ■■■ - v ' ^ / 

Since B is a constant and since the magnitude 


infinity 



of d is smaller than uni 
together with Eq.(2) yields 


H 


H 



(7) 


We shall consider now the more complex case in which the selections, 
although still independent, are not equally likely. In this case, too, we 
wish to compute the average amount of information per selection. For this 
purpose, we consider again the ensemble of all the messages consisting of 
n Independent selections and we look for a way of indicating any one partic 
ular message by means of elementary selections. If we were to proceed as 
before, and divide the ensemble of messages in two equal groups, the selec¬ 
tion of the group containing the desired message would no longer be a 

selection between equally likely choices, since the sequences themselves 
are not 




proper procedure is now, of course, to make 
equal for each group not the number of messages in it but the probability 
of its containing the desired message. 







group will be a selection between equally likely choices. This procedure 
of division and selection is repeated over and over again until 
message has been separated from the others. The successive selections of 
groups and subgroups will then form a sequence of independent elementary 


selections, 
One may 
form groups 





it will not generally be possible to 

message. 





any one of the messages from one group to the other will change, by finite 

amounts, the probabilities corresponding to the two groups. On the other 
hand, if the 

with which the 



of the messages is increased indefinitely, the accuracy 

groups can be made equal becomes 



better and better since the probability of each Individual 




zero. Even so, when 
after a large number 

probabilities of such subgroups as closely equal as desired unless 


message approaches 


Include only a few messages 

the 

we pro 





ceed from the beginning in an appropriate manner as indicated below The 
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messages are first arrang 
easily computed If the pr 
in groups and subgroups a 
of the messages, as Ulus 
subgroups will contain me 
that further subdivisions 
It is clear that whe 


nged in order of their probabilities, which can be 
probabilities of the choices are known. The divisions 
are then made successively without changing the order 
ustrated in Figure 2, In this manner, the smaller 
messages with equal or almost equal probabilities, so 


can be performed satisfactorily, 
the above procedure is followed 


binary selections required to separate any message from the 


Probabilities of Groups Obtained 
by Successive Divisions 


, the number of 
e others varies 
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0.51 



0.28 

0.23 


0 . 1 * 


0.14 


0.14 

0.09 



Div. Div. Div. Div 


V VI 

Div. Div. Message 



0.07 

0.07 


0.04 

0.05 


0.02 

0.03 


0.02 

o.oi 



20 


22 



0. 


100 



0.07 1101 


1110 


0.02 11110 


0.02 111110 


o.oi min 


0.42 


0.42 


0.28 


0.28 


0.16 


0.10 


0. 


0.06 


g'av. 
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ng of messages 
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of 2 third-order 
- 0.7, p(l) * 0.2, 

0.2 + 0.1 log- 0.1] 


code 


0*73 ; 


2H 


new code 
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from message to message. Messages with a high probability of being selected 
require less binary selections than those vlth lover probabilities. This 
fact Is In agreement with the Intuitive notion that the selection of a 
little-probable message conveys more Information than the selection of a 
more-probable one. Certainly, the occurrence of an event which ve know 
a priori to have a 99 per cent probability Is hardly surprising or, in our 
terminology, yields very little Information, while the occurrence of an 
event vhlch has a probability of only 1 per cent yields considerably more 
Information. More precisely, as shown below. If P(l) Is the probability 
of the 1 message, the number of binary selections required to Indicate 
this message will be an Integer Bg(i) close to -log 2 P(i). In fact, P(i) 
is just the probability of the last subgroup obtained by successively 
halving (approximately) the probability of the whole ensemble of messages 
(which is unity) a number of times equal to Bg(i), so that P(l) — 

By making the messages sufficiently long — that Is, the number n of H-order 


selections sufficiently large — the Integer Bg(i) can be made to differ In 
percentage from -log 2 P(i) by less than any desired amount. Hence, In this 
limiting case, we can write 



Let us consider now a sequence of M selections of messages, each message 
consisting of n H-order selections (forming a sequence of nM selections). 

By making the number M sufficiently large,ve can be practically sure that 
the 1 message will appear In the sequence with a frequency as close to 
F(l) as desired. Therefore the number of binary selections required on the 
average to select one message, that Is, "the mathematical expectation of 
Bg", will be 



The average amount of Information per H-order selection Is then, from 
Eqs. (8) and (9), 




11m 



S-l 



1*0 



that is, the limit of the ratio of the number of binary selection required, 
on the average, to select one message to the number of H-order selections 
In the message. 

How let p(k) be the probability of the k choice (of the H), and n,„ 
















be the number of times the k th choice is selected in the i*** 1 message 
(sequence of n selections). The probability of the 1 th message is 



k=0 


The ntimber of binary selections required to indicate this message can be 
written as 

N—1 


Lk=0 

with any degree of accuracy desired. In the limit when n approaches infinity 
these binary selections become elementary selections, that is, binary selec¬ 
tions between equally likely choices* We must now compute E(B S ) according 
to Eq.(9)# The number or sequences of selections, that is, messages^ to 
which correspond the same values of P(1) and Bg(1), is equal to the number 

of different permutations of the choices selected in the i th sequence; that 
is, to 



k-0 

It follows that the average value of B„(i) is given by 



k»0 
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The overall summation In Eq.(13) 
Integral positive values of the 


Is made over all possible combinations of 


n, which 

In order to compute the values of 





factorials In Eq. 




means 
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Eq. 

) we begin by expressing the 
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valid for large values of n. We obtain then 
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f(x) « ( 




The variables x, =. n 
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mr, — 

J TT 

P(k) 

[ k= 0 

x k J 

ilways 

P0Blt 

H-l 


^ x k 

-■. | 


x 


k 


1 


ksO 


It is convenient, at this 
continuous, rather than a 
the summation of Eq. 
when n. 
to a unit 





. We 
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unity and subject 
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zero to n,x. 
of n, (n, takes 


Increment of x equal to 
unit Increments of the 








summation of Eq. 
then becomes 
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N 




s ... 




x) 



k=0 


j x k 


(19) 



as a 

■ w w 

of the x,. and to transform 

regard, that 

zero to one. It follows that 
only Integral values) corresponds an 

, when n approaches Infinity, to the 
differentials dx.= 1/n. In con- 
can be transformed (.10) into an 







over the region of the hyperplane defined by 







Eq.(l9), in which all the x. are positive and smaller than one. It will be 
noted that in Eq.(20) x„ is considered as a function of all the other x, , 



so as to limit the integration to the above-mentioned hyperplane. 

To compute the integral appearing in Eq.(20), we observe first that the 
integral of f(x) alone over the same region represents the summation of the 
probabilities of all possible messages consisting of n selections, provided, 
of course, that n is sufficiently large. Therefore, the integral of f(x) 
must be equal to unity for all large values of n. On the other hand, as 
shovn in Appendix I, f(x) has a peak at a point which approaches x, = p(k) 
when n approaches infinity. The height of this peak is proportional to 

A 

(K—l)/n • It follows that vhen n approaches infinity, f(x) becomes a delta- 
function, or unit impulse* located atx fc - p(k); The integral of Eq.(20) 
is, therefore, equal to the value for Xj_ « p(k) of the rest of the Integrand, 
that is, of the summation. Eq.(20) yields finally 



IcbO 


which is then the average amount of information per N-order selection. 

The conclusions which can be reached from the evaluation of the Integral 
in Eq.(20) extend far beyond Eq. (22), It is easy to see that if the function 


N—1 



£ _I 

k*0 


were any other finite function of the Xjj., the limiting value of the integral 
would still be equal to the value of the function for x, ■ p(k). In other 


words, the expectation (or average value) of any 
to the value of the function itself for x, » 


view, we can say that the ensemble 




function of the x, is equal 
. Prom a physical point of 
sequences of selections can 


be divided in two groups. The first group consists of sequences for which 
the frequencies x k of occurrence of the different choices differ from the 
probabilities p(k) of the choices by less than amounts which approach zero 
as l//n when n approaches infinity. The total probability of the sequences 
in this group approaches unity when n increases indefinitely, and therefore 
the number of sequences in this group approaches 



The second group consists of all other sequences, and Its total probability 
approaches zero when n approaches Infinity. 

The sequences of the first group are all equally probable and, there¬ 
fore, the selection of one of them out of the group requires a number of 
binary selections equal to 

loggJJ « nH H . (24) 

In other words, the sequences of the first group can be represented by means 
of sequences of n binary digits, that Is H„ digits per N-order selection. 
All the other sequences together, regardless of the way In which they are 
represented, cannot Increase by any finite amount, beyond H«, the number of 
binary digits required on the average per order selection. 

The expression for Hjj obtained above Indicates that H« can be considered 
as the expectation of logg [l/p(k)]» In other words, we may say that the 
selection of a particular choice k conveys an amount of Information equal to 
the logarithm-base-two of the reciprocal of its probability. This inter¬ 
pretation is fundamental. It will be shown later to apply also to the 

* - ~J~ 

general case of non-independent selections. In which case p(k) will be 
substituted by the conditional probability that the k^* 1 choice will be selec¬ 
ted, based on the knowledge of all preceding selections. 

It is easy to see from Eq.(22) that Hj. vanishes only when all but one 
of the p(k) are equal to zero. In which case the one different from zero 
must be equal to unity. In other words, H„ vanishes only when the choice 
which will be selected Is known a priori with unity probability. In this 
Instance, It Is Intuitively clear that no Information Is being transmitted. 

On the other hand, H» Is a maximum (as shown in Appendix I), when all the 
p(k) are equal, that is, when there is no a priori knowledge at all about 
the selections. Under these circumstances, Eq.(22) reduces to Eq.(7), since 
p(k) « 1/n. The manner in which H„ varies with the probabilities of the 
choices is Illustrated In Figure 5, for the particular case of N ■= 2. 

The amount of Information conveyed by a message of given length was 
defined above as the number of Independent elementary (binary, equally likely) 
selections required, on the average, to specify such a message. The notion 
of a minimum number of binary selections required did not enter the defini¬ 
tion. It should be intuitively clear, however, that the minimum number of 
binary selections required, on the average, to specify a message is equal 
to the average Information conveyed, or. In other words, the number of 























Fig* 3 The amount of information per 
binary selection as a function of the 
probability of either choice. 


binary selections becomes a minimum when the selections are equally likely 
and Independent. To prove this identity, we observe that the amount of 
information conveyed by a sequence of Independent binary selections is a 
maximum when the selections are equally likely. Conversely, therefore, it 
is always possible to represent any sequence of m binary, not equally likely 
selections with a number of elementary selections smaller, on the average, 
than m. It follows that no binary representation of a message can be ob¬ 
tained with a number of selections smaller than the amount of Information 
conveyed. It Is clear, of course, that all message representations, which 
employ Independent equally likely selections,require,on the average,the same 
number of selections. It will be shown later that a larger number of 
selections is required whenever non-independent selections are used. 

It is appropriate to point out here that the mathematical form of 
Eq.(22) suggests a very Interesting analogy between Information and entropy, 
as expressed In statistical mechanics. In fact, H— appears formally as the 
entropy of a system whose possible states have probabilities p(k). For a 
physical interpretation of this analogy, the reader is referred to the work 
of Norbert Wiener (Ref. 1), 


IV. Codes and Code Efficiency 

The preceding sections have been devoted to the definition of the unit 
of Information and to the computation of the average amount of Information 
per selection in the case of messages consisting of sequences of Independent 
H-order selections* It was pointed out in Section III that Hj. represents 
the minimum number of binary selections required, on the average, to perform 
an H-order selection with given choice probabilities. Therefore, if we take 




the number of binary selections employed as a basis for comparing different 
methods of conveying the same Information, Hjj represents a theoretical limit 
corresponding to maximum efficiency. 

The knowledge of such a theoretical limit is extremely important, but 
perhaps even more important is the ability to approach this limit in practice. 
In our case, fortunately, the procedure followed in computing H R (that is, 
the theoretical limit) indicates a convenient method for approaching this 
limit in practice. Let us consider again all the sequences of n N-order 
selections (in which, however, n may be 4 small integer), and arrange them 
in order of increasing probability. If we wish to separate any one partic¬ 
ular sequence from the others by means of successive division in almost 
equally probable groups, as discussed in the preceding section, the number 
of divisions required, on the average, that is, E(B S ), will be larger 
than nHj|. However, if we Increase n, that is, the length of the sequences, 
we find that E(Bg)/n keeps decreasing and approaches H„ when n approaches 
infinity. It must be kept in mind, in this regard, that E(B g )/n does not 
decrease necessarily in a monotonic manner, but may have an oscillatory 
behavior as a function of n.* It follows that an Increase of n may actually 
produce an increase of E(B g )/n. For instance (as shown in Figure 4), in the 
case of H - 2, p(0) « 0.7, p(l) O.J>, the value of E(B g )/n is 0,905 for 
n = 2 , 0.909 for n » and 0.895 for n * 4, the limiting value being 

H 2 . 0.882. 

The above discussion indicates that, in transmitting a message consisting 
of a large number of selections, we should transmit the selections not indi¬ 
vidually, but in sequences of n as units, the number n being as large as 
permitted by practical considerations. The transmission of each of these 
units is then performed by means of sequences of binary selections corres¬ 
ponding in order to the successive divisions of the ensemble of all possible 
sequences of n N-order selections, as indicated in Figures 2 , 4, and 5 . It 
will be noted that, although the sequences of binary selections are not equal 
in length, it is always possible to Identify the end of any of them in a long 
message* In fact, the first m selections of any sequence of length larger 
than m are always different from any of the sequences consisting of exactly 
m selections. 

If it is desired to perform the transmission by means of H'-order selec¬ 
tions (N' being any integer), we can proceed in the same manner as in the 
case of binary selections, the only difference being that we must divide 
successively the ensemble of all possible sequences in N' groups instead of 
just two. After each division, the groups containing the desired sequence 

This fact was first pointed out to me by L* G. Kraft of this Laboratory* 
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will then he indicated by means of an N'-order selection. 

The operation described above is, effectively, a change of code, that 

is, we may say, of the conventional language in which the message is written. 

Therefore this operation will be referred to as "message recoding". The 

advantage resulting from this recoding is conveniently expressed in terms 

of the code efficiency TT 

% 

^ “ To^ * (25) 

that is, the ratio of the information transmitted on the average per selec¬ 
tion, to the information which could be transmitted with an equally likely 
selection of the same order. The efficiency of a binary code resulting from 
the recoding of sequences of N-order selections can be computed most con¬ 
veniently in the form 

nH«. 

^ * ( 26 ) 

where n is the number of N-order selections used in the recoding operation. 
Note that xtHj. is the average amount of information per sequence of n N-order 
selections and B(B S ) represents the amount of information vhich could be 
transmitted, on the average, by one of the sequences of binary selections 
in which the original sequences are recoded, if these binary selections were 
equally likely. If the new code is of N' order, we must substitute for 
E(Bq) the product of loggN* by the number of N'-order selections required, 
on the average, to specify a sequence of n N-order selections. 

A final remark must be made regarding the recoding operation. Since 
the process of successive divisions of an ensemble of sequences into equally 
probable groups cannot be carried out exactly, it is not clear at times 
whether one sequence should be included in one group or in another. Of 
course, we wish to perform all divisions in such a way as to obtain at the 
end the most efficient code. Unfortunately, no general rule could be found 
for determining at once how the divisions should be made in doubtful cases 
in order to obtain maximum code efficiency. However, so long as the divi¬ 
sions are made in a reasonable manner the resulting code efficiency will not 
differ appreciably from its maximum value. 

We have implicitly assumed in the foregoing discussion that we know 
a priori the probabilities p(k) of the choices for a message still to be 
transmitted. It seems appropriate at this point to discuss in some detail 
this assumption, since the practical value of the results obtained above 
depends entirely on its validity. When we state that the probability of a 
particular choice has a value p(k) we mean that the frequency of occurrence 
of that choice in a message originating from a given source is expected to 
be close to p(k). The longer is the message, the closer we expect the 
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frequency to approach p(k). It must he clear, however, that ve have no 
assurance that the frequency of occurrence will not differ considerably from 
the probability even in the case of a very long message, although such a 
situation Is very unlikely to arise. 

In practice, p(k) must be estimated experimentally following the reverse 
process, that is, by Inference from the measurement of the frequency in a 
number of sample messages. If the frequencies in the sample messages are 
reasonably alike, or, more precisely, if their values are scattered in the 
manner which might be expected on the basis of the length of the messages 
used, ve may feel relatively safe in taking their average value as a good 
estimate of the probability. In other words, ve may expect that the fre- 

_ - . ’ • - t ' 

quency in any other message originating from the same source will be reason¬ 
ably close to the average value obtained. If this is the case, the source 
of such messages is said to have a stationary statistical character. We can 
conceive the case, however, in which the frequencies in the sample messages 
available are so widely scattered that hardly any significance can be attrib¬ 
uted to their average value. Such a result may mean that the source has not 
a stationary statistical character, at least for practical purposes, in which 
case the concept of probability loses any physical significance. Fortunately 
however, the sources of interest appear to have a stationary character for 
any practical purpose. In addition, the estimates of the probabilities of 
the choices do not need to be too close. It should be clear, in this respect 
that the fact that a code has been designed for a particular set of choice 
probabilities does not mean that only messages with the same statistical 
character can be transmitted. It means only that such a code will transmit 
most efficiently, that is, with the smallest number of selections — messages 
with the choice frequencies equal to the assumed probabilities. Moreover, 
we can expect that the efficiency of transmission will not depend in a criti¬ 
cal manner on the actual frequencies of the messages to be transmitted. A 
proof that this is actually the case is given below. 

Suppose that a code which is optimum for a set of choice probabilities 
p'(k) is used to transmit messages with choice probabilities p(k). If ve 
consider again all possible sequences of n selections, the expression for 
the number of binary selections required, on the average, to indicate one 
particular sequence, E(Bg), is still given by Eq.(13), where, however, the 
p(k) which appear in the form log 2 p(k) should be changed into p'(k). It 
follows that, in the limit when n approaches infinity, the number of binary 
selections per H-order selection will approach, according to Eq.(22), the 
value 

fca 0 



It is clear from this equation that Hi varies rather slowly with any one of 
the p'(k), unless the corresponding p(k) is close to zero or unity. Hi is, 
of course, a m inimum when p 1 (k) = p(k). The case of N « 2 is illustrated in 
Figure 6 for p(0) « 0.5 and p(0) = 0.7. We may conclude, therefore, that 
the statistical characteristics assumed a priori can he r&ther different from 
those of the messages actually transmitted, without the efficiency being 
lowered too much. 



V. The Case of Non-Independent Selections 

Thus far we have been considering only messages of a particularly 
simple type, namely, messages consisting of sequences of Independent selec¬ 
tions. Obviously, the statistical character of most practical messages is 

much more complex. Any particular selection depends generally on a number 
of preceding selections. For instance, in a written message the probability 
that a certain letter will be an "h" is highest when the preceding letter 
is a "t", In a television signal the light Intensity of a certain element 
of a scanning line depends very strongly on the light intensities of the 
corresponding elements in the preceding lines and in the preceding frames. 

In fact, the light Intensity is very likely to be almost uniform over wide 
regions of the picture and to remain unchanged for several successive frames. 

The simplifying assumption that any one selection is independent of the 
preceding selections, although quite unrealistic, does not invalidate com¬ 
pletely the results obtained in the preceding sections, but merely reduces 
their significance to that of first approximations. Intuitively, the average 








amount of Information conveyed by a sequence of given length is decreased 
by the a priori knowledge of any correlation existing between successive 
selections. Therefore, the value given by Eq.(22) will always be larger 
than the correct value for the average amount of Information per selection, 
and the same Is true of the code efficiency given by Eq.( 25 ). Similarly, 
any recoding operation performed In the manner discussed In Section IV will 
result In a higher efficiency of transmission, but not so high as could be 
obtained by taking into account the correlation between successive selections. 

The procedure for computing the average amount of information per selec¬ 
tion and for recoding messages is still essentially the same as that used In 
Sections III and IV, even when the dependence of any selection on the pre¬ 
ceding selections Is taken into account. The only difference Is that the 
probability of a particular sequence will not be equal simply to the product 
of the probabilities of the choices In It, since these are no longer inde¬ 
pendent . We must still arrange all the possible sequences of given length 
n In order of probability, and separate the desired sequence by successive 


possible. The number of divisions required, on the average, divided by the 
number n of selections will approach when n approaches infinity. 



H„(n) the average amount of information per sequence of n selections when 
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Expressing now Ho(n) in terms of the successive Increments, we obtain 


(33) 


average amount of information per selection 
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To proceed further in our analysis, we must distinguish 
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depends in the same manner on the m preceding selection of the same sub¬ 
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Considering now in more detail the increments of information H„(&fl), 

average amount of information conveyed by 


our intuition indicates that 
any additional selection can be, at most, equal to the value obtained when 
the selection is independent of all preceding selections. Mathematically 


it must be 


If—1 
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(35) 


A proof of this inequality is given in Appendix II. In addition, it is 
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Intuitively clear also that, in the case of uniform statistical character, 
the average amount of information conveyed by the (nfl) selection of a 
sequence can be, at most, equal to the amount of information conveyed by the 

f-Vj 

n selection, since the latter has less preceding selections on which to 
depend. Mathematically, we expect that, for statistically uniform sequences. 



A proof of this inequality is given also in Appendix II. Eq.( 36 ) is satis- 

4.V 

fled with the equal sign when the (w-l) selection, and therefore any fol¬ 
lowing selection, depends only on the n-1 preceding selections. 

• ' • ■ 

Eq. ( 36 ) shows that the limit In Eq. (34) is approached in a monotonic 
manner. In addition, ve expect H R (m) to approach monotonlcally a limit vith 
increasing m, since the dependence of any selection on the preceding selec¬ 
tions cannot extend, in practice, over an indefinitely large number of selec¬ 
tions. Suppose, for Instance, that this dependence extends only over the 
n 0 -l preceding selection. Then H„(m) becomes constant and equal to E,,(n 0 ) 
when m is larger than n , and Eq.(3*0 yields 

«i - y«d). • (57) 

This result is correct, of course, only in the case of statistically uniform 
sequences. 

In the case of a periodically discontinuous statistical character, 

Eq.( 56 ) is valid only when the n*“ and the (nfl) th selections belong to the 

isV 

same sub-sequence. If this is not the case, the (nfl) selection must be 
the first selection of a sub-sequence, and therefore is independent of all 
preceding selections. It follows that H„(m) is a periodic function of m with 
period equal to the length u; of the eub-aequencee, and that the limit of 
Eq.(?4) is approached in an oscillatory manner. If we compute this limit by 
increasing n in steps equal to n', it is easily seen that Eq.(?4) yields 



m*l 


a value larger than that given by Eq.(37), as was expected. 

The recoding procedure in the case of messages consisting of non-inde¬ 
pendent selections is still the same as in the case of Independent selections. 
The efficiency of transmission, still given by Eq.(25), Increases (although not 
necessarily monotonlcally), with the number of selections used as units in the 
recoding process, and approaches unity when the number Increases indefinitely. 
It is worth emphasizing that in the recoding process any sequence, even if 
statistically uniform, is considered as periodically discontinuous. In fact, 

' - - V 
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the groups of selections recoded as units are effectively sub-sequences 
vhich are treated as though they were totally unrelated, It follows that. 
if the recoding operation of a statistically uniform sequence is performed 
on groups of ^selections, the efficiency of transmission after recoding 
can he at most equal to 


m=l 

In the case of statistically discontinuous sequences, it would seem 
reasonable to make the number of selections in the recoding groups an 
integral fraction or multiple of the length of the sub-sequences, 

A final remark is in order regarding the fitting of the recoding pro¬ 
cedure to the statistical character of the messages to be transmitted. It 
may happen, as it does in the case of television signals, that the depend¬ 
ence of any one selection on the m - preceding selection does not decrease 
monotonically when m increases, but behaves in an oscillatory manner. In 
this case, one should first reorder the selections before recoding, in such 
a manner that selections which are closely related take positions close to 
one another in the sequence. This idea of reordering the selections in the 
sequence can be generalized as follows. Any type of transmission of informa 
tion can be considered as the transmission, in succession, of patterns in 
a two-dimensional or multi-dimensional space, time being one of the dimen¬ 
sions. Then the problem of ordering selections in an appropriate manner 



is clear, on the other hand, that such a scanning problem is also at the 
root of the problem of reducing the bandwidth required by television signals 
The generalized scanning problem seems to be, therefore, of fundamental 
practical, as well as theoretical, importance. However, no work can yet be 
reported on this subject. 



or higher-order selections as possible. We did not consider, however, the 

physical process corresponding to such selections - that is, their trans¬ 
mission by electrical means. 











A convenient way of transmitting binary selections in a practical 
communication system is by means of pulses with two possible levels, one and 
zero. This is just the technique employed in pulse-code modulation. The 

• ft. 

maximum rate at which information can be transmitted in this case is simply 
equal to the number of pulses per second which can be handled by the elec¬ 
trical system — which we know to be proportional to the frequency band 
available. However, as soon as we start dealing with electrical pulses 
rather than logical operations like selections, an additional item must be 
considered in the problem, namely, the power required for the transmission. 
In the case of two-level pulses, the average power corresponding to the maxi 
mum rate of transmission of information is equal to one-half the pulse power 
since the zero and one levels are equally probable. 

If pulses with H rather than two levels equally spaced in voltage are 
used, the maximum fate of transmission is equal to log^N times the number of 
pulses per second which can be handled by the system. The average power 
required becomes, in this case. 



where W Q is the power corresponding to the lowest (non-zero) voltage level. 

The theoretical limit stated above for the rate of transmission of 

information certainly has practical significance when the limiting factors 

in the physical problem are the frequency band available and the number of 

pulse levels permitted by technical and economical considerations. It is 

to be noted, in this regard, that the effect of noise is here taken into 

account, to a first approximation, by setting a lower limit to the voltage 

difference between pulse levels, and therefore to W . For a detailed dis- 

o 

cussion of the effect of noise, the reader Is referred to the work of 
Shannon (5). 

Eq.(40) shows* on the other hand, that the average power Increases 

o 

approximately as N , while the rate of transmission Is proportional only to 
loggN* It follows that. If no limitation Is placed on the frequency hand 
employed, the smallest value of N should he used — that is, two* This value 
has, in addition, the very Important practical advantage that the receiver 

is not required to measure a pulse, hut only to detect the existence or the 
lack of a pulse. It might happen, on the other hand, that the frequency 
band and the average power are the limiting factors,while any reasonable 
number of pulse levels can he allowed. This case represents quite a dif¬ 
ferent problem from those considered above/ and the maximum rate of trans¬ 
mission of information is no longer obtained by making the pulse levels 




(that is, the choices) equally probable, as one might think at first. For 
example, more than one unit of information per pulse can be transmitted vlth 
an average power ¥ = W72, by using pulses with three levels not equally 
probable. It seems worth while, therefore, to determine the maximum amount 
of information which can be transmitted per pulse, for a given average power 
W, a minimum level power ¥_, and an unlimited number of pulse levels equally 

o 

spaced in voltage. 

Let, therefore, p(0) be the probability of occurrence of the zero level 
(no pulse), and p(k) the probability of the k ■ level. The amount of infor¬ 
mation per pulse is given by 

; '00 . - 

H = - Z—s p(k) logpP(k) , (41) 

k**0 

and the average power by » 

¥ - ¥ q p(k) k 2 . (42) 

k=0 


¥e wish to maximize H with respect to the p(k), subject to the condition 
expressed by Eq.(42) and, of course, the usual condition 


y> p(k) * 1 . 

z_i 

k=0 

The maximization procedure is carried out in Appendix III, and yields 




(44) 

(45) 


The values of p(l)/p(0) and p(0) are plotted in Figure 7 as functions 
of ¥/¥ . The value of H is plotted as a fTmctlon of the same variable 
in Figure 8. The latter curve shows, for instance, that the maximum amount 
of information per pulse for ¥=¥^2 is 1,14, that is, 14 per cent higher 
than the value obtained by using two equally probable levels. 

The procedure for approaching in practice the theoretical limit obtained 
above by appropriate recoding of the messages is very similar to that dis¬ 
cussed in Section IV. It differs only in that the ensemble of all sequences 
of given length must now be divided in groups with probabilities p(0), p(l) 
p(k)..., instead of in equally probable groups. The number of pulse levels 
to be used in practice (it should be infinite in theory) must be selected 
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Fig. 7 Behavior of p(l)/p(0) and Fig. 8 Maximum information per 

p(0) as functions of W/W_. pulse. H , as a fmotion of w/w . 

o ^ ’ max . 1 ' o 


on a compromise basis, and the values of the p(k) must then be readjusted, 
accordingly to make 



In addition to the effect of limitations on the average power, another 
important practical consideration has been neglected in the preceding sec¬ 
tions. All the types of recoding procedures suggested, for approaching in 
practice the theoretical limits derived above, require the use of devices 
capable of storing the information for a certain length of time in both the 
transmitter and the receiver. Such storage devices are needed to stretch 
or compress the time scale according to the probability of the group of 
original selections being recoded for transmission. 

Satisfactory storage units are not yet available. In addition, even 
were they available, their use would undoubtedly add considerably to the 
complexity of communication systems. On the other hand, any substantial 
increase of transmission efficiency is fundamentally based on time stretch¬ 
ing. In fact, since the logarithm of the probability of the choice or 
sequence of choices selected is a measure of the information conveyed by 
the selection (see p. 14), the time rate at which information is conveyed 
in actual signals may vary considerably with time. Even so, a communication 

\ . - , , 
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system must be &bls to handle at any time the peak rate which may he present 
In the signal. It follows that any system not employing storage devices to 
stretch or compress the time scale Is bound to have an efficiency lower than 
the ratio of the average rate to the peak rate at which information is fed 
to it. It is worth mentioning in this connection that in certain types of 
communlcatIons, such as telegraph, and television, the Input and output sig¬ 
nals do not have inherently fixed time scales/ This is the same as saying 
that such forms of communication inherently incorporate storage devices, 

* *. " 'v ’’ ' • ■ . • 

In the case of the telegraph, the written messages at the input and at the 
output are effectively storage devices. In the case of television, the 

w 

image to be televised and the cathode-ray tube perform the same function. 

Although no reduction of frequency band for a given noise level can 
be obtained without storage devices, appropriate coding may lead to some 
reduction of average power. This reduction can be obtained by assigning 
sequences of pulses requiring the smallest energy to the most probable 
messages, and vice versa. In the particular case of pulse-code modulation, 
for instance, this can be done as follows. We arrange all digit combinations 
in order of increasing amount of energy required and the sampling levels in 
order of decreasing probability. We assign then the digit combinations to 
the sampling levels in the resulting order. Such a coding method requires, 

however, more flexible coding and decoding units than those used in present- 
day systems. 

Before concluding this section, it should be made clear that the 
improvement of transmission efficiency discussed above and the resulting 
possible reduction of bandwidth requirements for a given signal power have 
little to do with the bandwidth reduction obtained by means of the Vocoder 
or other similar schemes. The Vocoder (2), for instance, does not improve 
the efficiency of transmission, but achieves a reduction in bandwidth by 
eliminating that part of the speech signal which is not strictly necessary 
for the mere understanding of the words spoken. Obviously, the recoding 
of messages according to their statistical character and the elimination 
of unnecessary information represent fundamentally different but equally 
important contributions to the solution of the bandwidth-reduction problem. 

Appendix I 
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Maximization of f(x) 

In determining the values of the x fc for which f(x), as given in Eq.(l8), 
is a maximum, it is more convenient to operate on the function 



whose maxima and minima at non-singular points coincide with those of f(x). 
The x. are the variables in the maximization process, but are subject to the 
constraint 
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j x k 


( 1 - 2 ) 


Using Lagrange's method, we equate to zero the partial derivatives, with 



x. 
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where X is a constant to be determined later. We obtain then N equations 
of the form 
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it is clear that when n approaches infinity these equations can be satis- 
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only when x» =* p,,, in which case Eq.(l-2) is also satis- 
function f(x) is neither discontinuous nor a minimum 
so that the existence of a maximum at this point does 




Maximization of H „ 

• • 

The function Hw given by Eq. (22) must be maximized with respect to the 
which are, of course. 
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same method as above, we obtain N equations of the form 
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This set of equations can be satisfied only if all the 
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is a maximum when P„,(ijk) = p(k), the probability of the k^“ choice, that 


is, when the additional selection is Independent of 
Mathematically, we must maximize the function H„ 
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The solution of the N 
and (II-2), is clearly 


IH-I 




are constants to be determined later. We obtain then. 



5) 


equations of this type, together with Eqs.(II-2) 


\ - P^U) , 


(II-6) 


|x. = in P 



9 




(H-7) 


Therefore, the Increment 
since this is the only point 


Hj. is a maximum for P 



« 

9 




at which a maximum can exist and a maximum must 


exist at some point. This result can also be stated in the form 








H„(n+l) < H m (1) , 




(II-8) 


where 


N—1 


H 


N 





log 



(II-9) 


k=0 


is the average amount of Information per selection, that is, the average 
increment of information, when each selection is independent of all preced 
ing selections. 


Proof That H n (ih- 1) < H M (n) 




Let us consider a sequence of n selections as consisting of a first 

selection followed by a sequence of n-1 selections. Let P (hjj) be the 

th 11 

that the selection of the h choice is followed by 

TV—T 

j sequence from the N possible sequences of n-1 



the selection of the 


selections. Let also P_.,(h,j;k) be the conditional probability that the 

HtX i.1, 

k choice is selected after the h ■ choice and the j sequence. We shall 

1.1a 

still indicate with p(k) the probability of the k choice and, similarly, 

4.1a 

with p(h) the probability of 
Eq.(ll-l) becomes 


th 

the h choice. Using these new symbols. 


H 


N 



N-1 



1 N-1 





h=0 j=0 


k=0 


p(h) P n (hjj) P 






,j;k) . (11-10) 


We wish to show that, for a statistically uniform sequence, 
maximum when P_,,(h,j;k) is Independent of h. 

Jut JL ^ 

maximize the function H«(nfl) with respect to the ir"" 


H 


N 



is a 


N' 

subject to the conditions 


N-1 



k=0 


£ind 


N-1 




P n (bjj) P 


h=0 


where P 




P n (jjk) is 
after the j 
condition 


the conditional 
th 


Mathematically, we must again 

variables Pm.i(h,j;k) 


P^r(h,jjk) -1 


(II-11) 







P n (jjk) , 


( 11 - 12 ) 



.th 

J 

that 



sequence of n-1 selections, and 
k ■ choice will be selected 
must, in turn, satisfy the 







( 11 - 13 .) 


which, however, does not concern us, since it does not involve directly the 
P n+ l(h,jjk). It must be clear, on the other hand, that the P (j;k) are kept 
constant in the maximization process. In other words, the dependence of* the 
(nil) selection on the n-1 preceding selection is fixed in this case, while 
in the case discussed previously it was allowed to vary. In addition, since 
we are dealing with a statistically uniform sequence, the (nfl) th selection 
depends on the n-1 preceding selections in the same manner as the n th selec¬ 
tion depends on its n-1 preceding, that is, on all the preceding selections. 


Proceeding in the same manner as 
we find that, 

when they are independent of h, that 


in the proof that Hjj(nHl) ^ H 


for given P (j;k). 


, of the 




make H 


N 



a maximum 


selection of the 


sequence. Mathematically speaking, the maximum occurs when P .(h,jjk) 


P n (j;k). It follows 


Eq 




yields, with the help of Eq.(ll-ll) 


%(»!) 


max. 


N 11-1 —1 N-1 

“ ^Lj ^ i P n-l^ J ^ p n O» k ) lo gg p n (jJ k ) * H K (n) 
j=0 k=0 

This result can also be stated in the form 

H K (n+l) « H N (n) . 




It must be clear that, in the case of non-statistically uniform sequences, 
P n (j» k ) may be an entirely different function than that representing the 
dependence of the n selection on the first nr-1 selections of the sequence, 
since, for instance, the (nfl) th selection can be entirely independent of 
the preceding selections while the n th selection is not. It follows, in this 
latter case, that Eq,(II-14) Is not valid, and H»(nfl) can be as large as 

h n (i). 


We wish to maximize the average amount of information per pulse, H, for 

• ' . . ) 

a given average power and an unlimited number of pulse levels equally spaced 
in voltage. Mathematically, this amounts to maximizing the function given 
by Eq.(41), subject to the conditions imposed by Eqs,(42) and (43), Follow¬ 
ing Lagrange's method, as in Appendices I and II, we obtain an Infinite set 
of equations of the form 


-52- 




1 + In p(k) » \ + k 2 p. , 


(III-1) 


where 7\ and M. are Indeterminate constants. The first of these constants, 
can be eliminated by subtracting the equation with k=0 from all the other 
equations of the set, which take then the form 


In p(k) — In p(0) • k 2 ^ , (III-2) 

2 

The remaining constant, M- , is then eliminated by subtracting k times 
Eq.(lII-2) — with k*0 — from the other equations of the same set. We obtain 
in this manner a set Of equations of the form 

[in p(k) — In p(0) ] — k 2 [in p(l) - In p(Q)i- 0 (III-3) 

It follows that 



(III-4) 


and 




can now be written in the forms 


00 




kal 




k 


W 




(III-5) 


(ill-6) 


The values of p(l)/p(0) are plotted in Figure 7 as functions of W/W . From 
these values, the p(k) are immediately obtained by means of Eq.(III-4). 

The maximum value of the average amount of information H can now be 
obtained without difficulty by substituting for the p(k) in Eq*(4l) the 
values determined above. We have then, after appropriate manipulation of 


max 






(HI-7) 



Using now Eq.(lII-5), we obtain finally 
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